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Abstract. Phylogenetic algebraic geometry is concerned with certain complex projec- 
tive algebraic varieties derived from finite trees. Real positive points on these varieties 
represent probabilistic models of evolution. For small trees, we recover classical geometric 
objects, such as toric and determinantal varieties and their secant varieties, but larger 
trees lead to new and largely unexplored territory. This paper gives a self-contained 
introduction to this subject and offers numerous open problems for algebraic geometers. 



1. Introduction 

Our title is meant as a reference to the existing branch of mathematical biology which 
is known as phylogenetic combinatorics. By "phylogenetic algebraic geometry" we mean 
the study of algebraic varieties which represent statistical models of evolution. For general 
background reading on phylogenetics we recommend the books by Felsenstein and 
Semple-Steel . They provide an excellent introduction to evolutionary trees, from the 
perspectives of biology, computer science, statistics and mathematics. They also offer 
numerous references to relevant papers, in addition to the more recent ones listed below. 

Phylogenetic algebraic geometry furnishes a rich source of interesting varieties, includ- 
ing familiar ones such as toric varieties, secant varieties and determinantal varieties. But 
these are very special cases, and one quickly encounters a cornucopia of new varieties. The 
objective of this paper is to give an introduction to this subject area, aimed at students 
and researchers in algebraic geometry, and to suggest some concrete research problems. 

The basic object in a phylogenetic model is a tree T which is rooted and has n labeled 
leaves. Each node of the tree T is a random variable with k possible states (usually k is 
taken to be 2, for the binary states {0, 1}, or 4, for the nucleotides {A,C,G,T}). At the 
root, the distribution of the states is given by tt = (tti, . . . ,7ik). On each edge e of the tree 
there is a. k x k transition matrix Mg whose entries are indeterminates representing the 
probabilities of transition (away from the root) between the states. The random variables 
at the leaves are observed. The random variables at the interior nodes are hidden. Let 
N be the total number of entries of the matrices Mf, and the vector tt. These entries 
are called model parameters. For instance, if T is a binary tree with n leaves then T has 
2n — 2 edges, and hence N = {2n — 2)k'^ + k. In practice, there will be many constraints 
on these parameters, usually expressible in terms of linear equations and inequalities, 
so the set of statistically meaningful parameters is a polyhedron P in M^. Sometimes, 
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these constraints are given by non-linear polynomials, in which case P would be a semi- 
algebraic subset of M^. Specifying this subset P means choosing a model of evolution. 
Several biologically meaningful choices of such models will be discussed in Section 01 

Fix a tree T with n leaves. At each leaf we can observe k possible states, so there are 
/c" possible joint observations we can make at the leaves. The probability 0o- of making 
a particular observation a is a polynomial in the model parameters. Hence we get a 
polynomial map whose coordinates are the polynomials (pa- This map is denoted 

The map depends only on the tree T and the number k. What we are interested in 
is the image 4>{P) of this map. In real-world applications, the coordinates (p^ represent 
probabilities, so they should be non-negative and sum to 1. In other words, the rules of 
probability require that (f){P) lie in the standard (/c" — l)-simplex in M'^". In phylogenetic 
algebraic geometry we temporarily abandon this requirement. We keep things simpler 
and closer to the familiar setting of complex algebraic geometry, by replacing (p by its 
complexification (p: C'^", and by replacing P and (p{P) by their Zariski closures in 

and C^" respectively. As we shall see, the polynomials (pa are often homogeneous and 
(p{P) is best regarded as a subvariety of a projective space. 

In Section |21 we give a basic example of an evolutionary model and put it squarely 
in an algebraic geometric setting. This relation is then developed further in Section ^ 
where we describe the main families of models and show how in special cases they lead to 
familiar objects like Veronese and Segre varieties and their secant varieties. Section H] is 
concerned with the widely used Jukes- Cantor model, which is a toric variety in a suitable 
coordinate system. In the last section we formulate a number of general problems in 
phylogenetic algebraic geometry that we find particularly important, and a list of more 
specific computationally oriented problems that may shed light on the more general ones. 

2. Polynomials maps derived from a tree 

In this section we explain the polynomial map (p associated to a tree T and an integer 
k > 1. To make things as concrete as possible, let A; = 2 and T be the tree on n = 3 
leaves pictured below. 



TT 




12 3 



The probability distribution at the root is an unknown vector (ttq, tti). For each of the 
four edges of the tree, we have a 2 x 2-transition matrix: 
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Ma 




Mb 




Altogether, we have introduced = 18 parameters, each of which represents a probabihty. 
But we regard them as unknown complex numbers. The unknown ttq represents the 
probability of observing letter at the root, and the unknown 6oi represents the probability 
that the letter gets changed to the letter 1 along the edge b. All transitions are assumed 
to be independent events, so the monomial 



represents the probability of observing the letter u at the root, the letter v at the interior 
node, the letter i at the leaf 1, the letter j at the leaf 2, and the letter k at the leaf 3. Now, 
the probabilities at the root and the interior node are hidden random variables, while the 
probabilities at the three leaves are observed. This leads us to consider the polynomial 



This polynomial represents the probability of observing the letter i at the leaf 1, the letter 
j at the leaf 2, and the letter k at the leaf 3. The eight polynomials (pijk specify our map 



In applications, where the parameters are really probabilities, one immediately replaces 
C^^ by a subset P, for instance, the nine-dimensional cube in M}^ defined by the constraints 



In phylogenetic algebraic geometry, on the other hand, we allow ourselves the luxury of 
ignoring inequalities and reality issues. We regard as a morphism of complex varieties. 

The most natural thing to do, for an algebraic geometer, is to work in a projective space. 
The polynomials fijk are homogeneous with respect to the different letters a, b, c, d and 
TT. We can thus change our perspective and consider our map as a projective morphism 



This morphism is surjective, and it is an instructive undertaking to examine its fibers. 

To underline the points made in the introduction, let us now cut down on the number 
of model parameters and replace the range of the morphism by a natural subset P. For 
instance, let us define P by requiring that the four matrices are identical 




(pijk — T^oO'OibooCOjdok + T^O^'OiboiCljdlk + T^l^'ubloCojdok + T^lO'libllCijdii^. 



0: C^^ ^C^ 




TTo + TTi = 1, TTo.TTi > 0, 

1, aQo,aQi>0, Oio + o-ii 

1, boo, boi > 0, 6io + fell 

1, Coo, coi > 0, Cio + Cii 

1, dooidoi>0, diQ + dii 



1, aio,aii > 
1, bio,bu > 
1, cio,cii > 
1, dio,du > 0. 



0: X X X X ^ P'^. 
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Equivalently, P = F^-^g x P\ where P^^^ is the diagonal of x p3 x x P^. The 
restricted morphism 0|p: Pj^j^^ x P^ ^ P'' is given by the following eight polynomials: 

0000 = TTofloo + TToaooaoiOio + TTia^Qapo + TTiafoan 

0001 = TTofloo'^oi + TroaoofloiOioflii + TTia^o'^ooOoi + Tria^Qa^;^ 

0010 = TToago^oi + TToaooOoiaioaii + TTiaigaooaoi + ttio^q'^ii 

2 2 2 2 2 3 

0011 = TTofloo^oi + TToaooaoiaii + TTiaigaoi + ^riaioan 

3 2 2 2 2 2 

0100 = TToaooOoi + TToaQia^Q + vrianaioaoo + Tria^Qa^i 

0101 = ^o'^oo'^oi + TToaQj^aioaii + vrianaioaooaoi + TTiaioflii 

0110 = TToaooOoi + vToaoiaioaii + vrianaioaooaoi + T^iawaf-^ 

0111 = vTofloiaoo + vToaoiaii + vriaiiaioo-oi + ^ic^ii- 

The image of 0|p lies in the 5-dimensional projective subspace of P'' defined by 0ooi = 
0010 and 0ooi = 0oio- It is a hypersurface of degree eight in this P^. The defining 
polynomial of this hypersurface has 70 terms. Studying the geometry of this fourfold is 
a typical problem of phylogenetic algebraic geometry. For instance, what is its singular 
locus? 

The definition of the map for an arbitrary tree T with n leaves and an arbitrary 
number k of states is a straightforward generalization of the n = 3 example given above. 
It is simply the calculation of the probabilities of independent events along the tree. In 
general, each coordinate of the map is given by a polynomial of degree equal to the 
number of edges of T plus one. If the root distribution is not a parameter, the degree of 
these polynomials is one less. 

One staple among the computational techniques for dealing with tree based probabilistic 
models is the sum-product algorithm. The sum-product algorithm is essentially a clever 
application of the distributive law that allows for the fast calculation of the polynomials 
0CT as well as the derivation of some polynomial relations among these. The basic idea 
is to factor the polynomials that represent 0o- up the tree. For instance, in our example 
above with homogeneous rate matrix: 

0000 = 7roaoo(aoo(aoo) + aoi(a^o)) + vriaio(aio(aoo) + aii(a^o)) 

which can be evaluated with 10 multiplications and 3 additions instead of the initial ex- 
pression which required 16 multiplications and 3 additions. In Section |Sl we will show 
how these factorizations help in identifying polynomial relations among the 0o-, i.e., poly- 
nomials vanishing on the image of 0. 

3. Some Models and Some Familiar Varieties 

Most evolutionary models discussed in the literature have either two or four states for 
their random variables. The number n of leaves (or taxa) can be arbitrary. Computer 
scientists will often concentrate on asymptotic complexity questions for n — oo, while 
for our purposes it would be quite reasonable to assume that n is at most ten. There are 
no general restrictions on the underlying tree T, but experience has shown that trivalent 
trees and trees in which every leaf is at the same distance from the root are often simpler. 
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Suppose now that the number k of states, the number n of taxa and the tree T are 
fixed. The choice of a model is then specified by fixing a subset P C C^. The set P 
comprises the allowed model parameters. Here is a list of commonly studied models: 

General Markov: This is the model P = C^. All the transition matrices Me 
are pairwise distinct, and there are no constraints on the k"^ entries of Mg. The 
algebraic geometry of this model was studied by Allman and Rhodes ^Ej- 

Group Based: The matrices Mg are pairwise distinct, but they all have a special 
structure which makes them simultaneously diagonalizable by the Fourier trans- 
form of an abelian group. In particular, P is a linear subspace of C^, specified by 
requiring that some entries of Mg coincide with some other entries. For example, 
the Jukes- Cantor model for binary states {k = 2) stipulates that all matrices Mg 

have the form ^^]. The Jukes-Cantor model for DNA (k = 4) is the topic 

of the next section. For more information on group-based models see ^01 1221 121] • 

Stationary Base Composition: The matrices are distinct but they all share 
the common left eigenvector tt = (vri, . . . , Hk). This hypothesis expresses the as- 
sumption that the distribution of the four nucleotides remains the same throughout 
some evolutionary process. An algebraic study of this model appears in [3]. 

Reversible: The matrices Me are distinct symmetric matrices with the common left 
eigenvector vr = (1,1,..., 1). Again, as before, P is a linear subspace of C^. 

Commuting: The matrices Me are distinct but they commute pairwise. We have 
not yet seen this model in the biology literature, but algebraists love the commut- 
ing variety ^^HP]. It provides a natural supermodel for the next one. 

Substitution: The Me matrices have the form exp(te ■ Q) where Q is a fixed matrix. 
Equivalently, all matrices Me are powers of a fixed matrix A = exp{Q) with 
constant entries, but where the exponent tg is a real parameter. This is the most 
widely used model in biology (see but for us it has the disadvantage that it is 
not an algebraic variety, unless the rate matrix Q has commensurate eigenvalues. 

Homogeneous: The matrices Me are all equal, or they all belong to a small finite 
collection. In this model, the number of free parameters is small and independent 
of the tree, so the parametric inference algorithm of jTHj runs in polynomial time. 

No Hidden Nodes: When all nodes are observed random variables then the pa- 
rameterization becomes monomial, and the model is a toric variety. For the ho- 
mogeneous model, the combinatorial structure of this variety was studied in jH| 

Mixture models: Suppose we are given m trees Ti, . . . ,Tm (not necessarily dis- 
tinct) on the same set of taxa. Each tree Tj has its own map (pi : —>■ C" . The 
mixture model is given by the sum of these maps, that is, (f> = (pi + ■ ■ ■ + (pm- For 
example, the case Ti = T2 = ■ ■ ■ = Tm and k = 4 may be used to model the fact 
that different regions of the genome evolve at different rates. See \12\ I13j. 

Root distribution: For any of the above models, the root distribution tt can be 
taken to be uniform, tt = (1, 1, . . . , 1), or as a vector with k independent entries. 

Among these models are many varieties which are familiar in algebraic geometry. 



Segre Varieties: These appear as a special case of the model with no hidden nodes. 
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Veronese Varieties: These appear as a special case of the homogeneous model with 
no hidden nodes. The models in jS] are natural projections of Veronese varieties. 

Toric Varieties: The previous two classes of varieties are toric. All group-based 
models are seen to be toric after a clever linear change of coordinates. The toric va- 
rieties of some Jukes-Cantor models will be discussed in the next section. Grobner 
bases of binomials for arbitrary group-based models are given in |21] . 

Secant Varieties and Joins: Joins appear when taking the mixture models of a 
collection of models. The secant varieties of a model amounts to taking the mixture 
of a model with itself. A special case of the general Markov model includes the 
secant varieties to the Segre varieties jl] . The secant varieties to Veronese varieties 
|3] appear as special cases of the homogeneous models with hidden nodes. 

Determinantal Varieties: Many of the evolutionary models are naturally embed- 
ded in determinantal varieties, because the tree structure imposes rank constraints 
on matrices derived from the probabilities observed at the leaves. Getting a better 
understanding of these constraints is important for both theory and practice [U]. 

The remainder of this section is the discussion of one example which aims to demon- 
strate that phylogenetic trees arise quite naturally when studying these classical objects 
of algebraic geometry. Consider the Segre embedding of x x x P^ in P^^. This 
four-dimensional complex manifold is given by the familiar monomial parameterization 

Pijki = Ui ■ Vj ■ Wk ■ xi, k, I e {0, 1}. 



Its prime ideal is generated by the 2 x 2-minors of the following three 4 x 4-matrices: 



^Poooo 


Poooi 


Poolo 


Pool A 




'^Poooo 


Poool 


POlOO Poioi^ 




'^Poooo Poolo POlOO Poiio^ 


POlOO 


Poioi 


Poiio 


Pom 




Poolo 


Pooii 


Poiio Pom 




Poool Pooii Poioi Pom 


PlOOO 


PlOOl 


PlOlO 


PlOll 




PlOOO 


PlOOl 


PllOO PllOl 




Piooo Piolo PllOO Pino 


\PllOO 


PllOl 


Pino 


Pun/ 




\Pl010 


PlOll 


Pino Pun/ 




\Piooi PlOll Piioi Pun/ 


These three matrices 


reflect the following three bracketin 


gs 


of the parameterization: 




Pijkl 




Vj) ■ {Wk 


Xl)) = 


{{Ui ■ 


Wk) ■ {Vj ■ Xi)) 




{{Ui ■ Xi) ■ {Vj ■ Wk)). 



And, of course, these three bracketings correspond to the three binary trees below. 




u vw xu wv xu xvw 



Let X denote the first secant variety of the Segre variety P"*^ x P-*^ x P-*^ x P^. Thus X 
is the nine-dimensional irreducible subvariety of P^^ consisting of all 2 x 2 x 2 x 2-tensors 
which have tensor rank at most 2. The secant variety X has the parametric representation 

Pijkl = VTo ■ Uoi ■ Voj ■ Wok ■ Xoi + VTi ■ Uu ■ Vij ■ Wik ■ Xu. 
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This shows that the secant variety X equals the general Markov model for the tree below. 



TT 




1234 



The prime ideal of X is generated by all the 3 x 3-minors of the three matrices above. 
We write X(i2)(34) for the variety defined by the 3 x 3-minors of the leftmost matrix, 
-^(i3)(24) for the variety of the 3 x 3-minors of the middle matrix, and X(i4)(23) for the 
variety of the 3 x 3-minors of the rightmost matrix. Then we have, scheme-theoretically, 

(3.1) X = X(i2)(34) n X(i3)(24) n X(i4)(23). 

These three varieties are the general Markov models for the three binary trees depicted 
above. For instance, the determinantal variety X(i2)(34) equals the general Markov model 
for the binary tree below. 




1 2 3 4 



Indeed, the standard parameterization of this model equals 

Pijki = TTo ■ {aooUoiVoj + aoiUiiVij) ■ {booWokXoi + boiWikXu) 
+7ri ■ {aioUiiVij + auUiiVij) ■ {bioWikXu + buWikXu). 

This representation shows that the leftmost 4x4 matrix has rank at most 2, and, con- 
versely, every 4x4 matrix of rank < 2 can be written like this. We conclude that the 
general Markov model appears naturally when studying secant varieties of Segre varieties. 

It is instructive to redo the above calculations under the assumption u = v = w = x. 
Then the ambient P^^ gets replaced by the four-dimensional space with coordinates 

Po = Poooo 
Pi = Poooi = Pooio = Poioo = Piooo 
P2 = Pooii = Poioi = Pono = I'looi = Pioio = Piioo 
P3 = Pom = Pioii = Piioi = Pino 

Pi = Pun 
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Under these substitutions, all three 4 x 4-matrices reduce to the same 3 x 3-matrix 

Po Pi P2\ 
Pi P2 Ps 

P2 P3 Pi) 

The ideal of 2 x 2-minors now defines the rational normal curve of degree four. This 
special Veronese variety is the small diagonal of the Segre variety x P-*^ x P-*^ x P-*^ c P"*^^ . 
The secant variety of the rational normal curve is the cubic hypersurface in P^ defined by 
the determinant of the 3x3 matrix. Hence, unlike ^i.Wji . the homogeneous model satisfies 

(3.2) X = X(i2){34) = ^(13)(24) = -^{14){23)- 

Studying the stratifications of P"'' induced by phylogenetic models, such as p.ip and 
fl3.2|l . will be one of the open problems to be presented in Sectional First, however, let 
us look at some widely used models which give rise to a nice family of toric varieties. 

4. The Jukes-Cantor model 

The Jukes-Cantor model appears frequently in the computational biology literature 
and represents a family of toric varieties which have the unusual property that they are 
not toric varieties in their natural coordinate system. Furthermore, while at first glance 
they sit naturally inside of P'^""^, the linear span of these models involve many fewer 
coordinates. In this section, we will present examples of these phenomena, as well as 
illustrate some open problems about the underlying varieties. 

Example 4.1. Let T be the tree with 3 leaves below. 




1 2 3 



We consider the Jukes-Cantor DNA model of evolution, where each random variable 
has 4 states (the nucleotide bases A,C,G,T) and the root distribution is uniform, i.e., 
vr = (1/4, 1/4, 1/4, 1/4). The transition matrices for the Jukes-Cantor DNA model have 
the form 



Ma = 





ai 




aA 




ao 


ai 


ai 




ai 


ao 


ai 




ai 


ai 


aoj 



The transition matrices Mj,, M^, and are expressed in the same Hankel form as Ma 
with "a" replaced by b, c, and d respectively. From these matrices and the rooted tree T, 
we get the map 

: P^ X P^ X P^ X P^ ^ P^^ 
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where the coordinates of P^^ are the possible DNA bases at the leaves. For example, 



Paaa = -:{aoboCodo + 3aibiCodo + 3aiboCodo + Sao^iCoC^o + GaibiCidi). 



That is, Paaa is the probability of observing the triple AAA at the leaves of the tree. 
Since this parameterization is symmetric under renaming the bases, there are many linear 
relations. 



Paaa = Pccc = Pggg 


= Pttt 


4 terms 


Paac 


= Paag 


= Paat = ■ 


■ ■ = Pttg 


12 terms 


Paca 


= Paga 


= Pat A = ■ 


■ ■ = Ptgt 


12 terms 


PCAA 


= Pgaa 


= Ptaa = ■ 


■ ■ = Pgtt 


12 terms 


Pacg 


= Pact 


= Pagt = ■ 


■ ■ = Pggt 


24 terms 



We are left with 5 distinct coordinates. From the practical standpoint, one is often 
interested in the accumulated coordinates, which are given parametrically as follows 

Pi23 = eocodo + SeiCidi 
Pi2 = SeoCodi + 3eiCi(io + GeiCidi 
Pi3 = 3eoCido + SeiCodi + QeiCidi 
P23 = 3eiCorfo + 3eoCi(ii + GeiCidi 
Pdis = QeiCidQ + 6eiCorfi + QeQCidi + QeiCidi 

where Cq = cto^o + 3ai6i and ei = ao&i + aih^ + 2aihi. Interpreting these coordinates in 
terms of the probabilistic model: P123 is the probability of seeing the same base at all 
three leaves, pij is the probability of seeing the same base at leaves i and j and a different 
base at leaf /c, and pdis is the probability of seeing distinct bases at the three leaves. 

Note that the image of is a three dimensional projective variety. This is a consequence 
of the uniform root distribution in this model. The fiber over a generic point is isomorphic 
to and stems from the fact that it is not possible to individually determine the matrices 
Ma and Mi,. Only the product MaMh can be determined. It is easily computed that the 
vanishing ideal of this model is generated by one cubic with 19 terms. 

Remarkably, there exists a linear change of coordinates so that this polynomial becomes 
a binomial. Thus the corresponding variety is a toric variety in the new coordinates. This 
change of coordinates is given by the Fourier transform, see [21] for details. In these 



10 N. ERIKSSON, K. RANESTAD, B. STURMFELS, AND S. SULLIVANT 

coordinates the parameterization factors: 

goooo = Pi23 + Pi2 + Pvs + P23 + Pdis = (flo + 3ai)(6o + 36i)(co + 3ci){do + 3di) 
11 1 

(ao + 3ai)(6o + 36i)(co - Ci)(rfo - di) 
(ao - ai)(&o - &i)(co + 3ci)(rfo - c^i) 
(ao - ai){bo - 61) (cq - Ci){do + 3di) 
= (ao - ai)(6o - 61) (cq - Ci){do - di) 

In the Fourier coordinates, the cubic with nineteen terms becomes the binomial 



?ooii ■ 


= P123 - 


2^12 - 2^13 - 


^P23 - 


r^Pdis 


Q'llOl 


= P123 - 


1 

2^12 + Pl3 - 


^P23- 


1 

"^Pdis 


91110 


= P123 + 


1 

P12 - 2^13 - 


^1^23- 


1 

"^Pdis 


gnu 


= P123 - 


1 1 

3P12 - 3P13 - 


-\P2. 


1 

+ -Pdis -- 



gooiigiiiogiioi goooogiiii- 

These Fourier coordinates are indexed by the subforests of the tree, where we define 
a subforest of a tree to be any subgraph of the tree (necessarily a forest), all of whose 
leaves are leaves of the original tree. For instance, the coordinate goooo corresponds to the 
empty subtree, the coordinate gnoi corresponds to the tree from leaf 1 to leaf 3 and not 
including the edge to leaf 2, and the coordinate gim corresponds to the full tree on three 
leaves. In general there are -F2n-i Fourier coordinates for a tree with n leaves, where 
is the m-th Fibonacci number. 

Example 4.2. Now we consider an example of the Jukes-Cantor DNA model with uniform 
root distribution on the following tree T with 4 leaves. 




The variety of this model naturally lives in a 4^ — 1 = 255 dimensional projective space. 
However, after noting the symmetry of the parameterization, as in the previous example, 
there are only 15 coordinates in this model which are distinct. After applying the Fourier 
transform, the parameterization factors into a product, and hence, the variety is naturally 
described as a toric variety in P^^. However, there are in fact 2 extra linear relations 
which are not simply expressed as a simple equality of probabilities so that our variety 
sits most naturally inside a P^^. Note that 13 = F2.4-1, a Fibonacci number, as previously 
mentioned. We will present the parameterization in these 13 Fourier coordinates. 

Associated to each of the six edges in the tree is a matrix with two parameters (ao and 
ai, bo and bi, etc.) as in the previous example. The Fourier transform is a linear change 
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of coordinates not only on the ambient space of the variety, but also on the parameter 
space. The new parametric coordinates are given by 

Mo = ao + 3ai, Ui = ao — ai, vo = bo + 3bi, vi = bo — bi,... 

and so on down the alphabet. To each subforest of the 4 taxa tree T, there is a coordinate 
Qijkimn, whcrc the index ijklmn is the indicator vector of the edges which appear in the 
subforest. 

The parameterization is given by the following rule 



The ideal of phylogenetic invariants in the Fourier coordinates is generated by poly- 
nomials of degrees two and three. The degree two invariants are given by the 2x2 
determinants of the following matrices: 



The dimensions of these matrices are also Fibonacci numbers. The rows of these matrices 
are indexed by the different edge configurations to the left of the root and the columns are 
indexed by the edge configurations to the right of the root. There are also cubic invariants 
which do not have nice determinantal representations. They come in two types: 

QOOOOjkQlUllmClnUno — QllOOj kC[l011lmC[0111no , QjkOOOOCllmlUlClnolUl — QjkOOllQlmlWlClnoniO- 

The only condition on j, k, /, m, n, o is that each index is actually the indicator function 
of a subforest of the tree. The variety of the Jukes-Cantor model on a 4 taxa tree has 
dimension 5, so its secant variety is a proper subvariety in P^^. In applications, the secant 
varieties of the model are called mixture models. For this model, the secant variety has the 
expected dimension 11, and so is a hypersurface. Since the matrix Mi has rank 1 on the 
original model, it must have rank 2 on the secant variety: thus, the desired hypersurface 
is the 3x3 determinant of Mi. 



Qijklmn ' '^j ' ^k ' -^l ' Vi 




Q'oooooo (Zooooii 

^110000 Q'llOOll 




(4.1) 




Example 4.3 (Determinantal closure). Now consider the Jukes-Cantor DNA model with 
uniform root distribution on a binary tree with 5 leaves, as pictured: 
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As in Example 14.11 the Fourier coordinates (modulo linear relations) are given by the 
subforests of T, of which there are 34. In the Fourier coordinates, this ideal is generated 
by binomials of degree 2 and 3, of the types we have seen in the previous example. While 
the cubic invariants have a relatively simple description, the quadratic invariants are 
all represented as the 2x2 determinants of matrices naturally associated to the tree. In 
particular, tools from numerical linear algebra can be used to determine if these invariants 
are satisfied. Since the degree 2 invariants are all determinantal, it seems natural to ask 
what algebraic set these determinantal relations cut out: that is, what is the determinantal 
closure of the variety of the Jukes-Cantor DNA model on a five taxa tree? The ideal of 
this determinantal closure is generated by the 2 x 2-minors of the four following matrices: 



/ Q'llOOllll Q'llOOOOll Q'llOOlllO Q'llOOllOl Q'llOOOOOO \ 
V^'OOOOllll 900000011 %0001110 ^00001101 ^00000000 / 



^10111000 Q'loiioioi Q'loiioiio ^10111011 Q'loiioiii 910111101 910111110 910111111 
I 911111000 911110101 911110110 911111011 9iiiioiii 9iiiiiioi 9iiiiiiio 9iiiiiiii 
\9oiiiiooo 901110101 901110110 9oiiiioii 9oiiioiii 9oiiiiioi 9oiiiiiio 9oiiiiiii^ 

911111000 911000000 901111000 9iooiiooo 9ooooooooA 
911111011 911000011 901111011 910111011 9ooooooii / 



900001101 910110101 901110101 9iiooiioi 9iiiioioi 9ioiiiioi 9oiiiiioi 9iiiiiioi 
I 900001111 910110111 901110111 9iiooiiii 9iiiioiii 9ioiiiiii 9oiiiiiii 9iiiiiiii 
\9ooooiiio 910110110 901110110 9iiooiiio 9iiiioiio 9ioiiiiio 9oiiiiiio 9iiiiiii0y 
Surprisingly, this ideal is actually a prime ideal, and so the algebraic set is a toric variety. 
It has dimension 10 and degree 501, whereas this Jukes-Cantor model has only dimension 
7. How does the Jukes-Cantor model sit inside its determinantal closure? 

5. Problems 

The main problem in phylogenetic algebraic geometry is to understand the complex 
variety, i.e., the complex Zariski closure 



= 0(P), 

of a phylogenetic model. This problem has many different reformulations, depending on 
the point of view of the person posing the problem. One problem posed by computational 
biologists jni Cni is to determine the "phylogenetic invariants" of the model. 

Problem 5.1 (Phylogenetic invariants). Find generators of the ideal defining Xq. 

Problem 5.2. Which equations or phylogenetic invariants are needed to distinguish be- 
tween different models? 

These problems are of particular interest for applications in phylogenetics, where one 
wishes to find which tree gives the evolutionary history of a set of taxa. Some more 
geometric problems are: 

Problem 5.3. What are the basic geometric invariants of </> and Xc for the various models? 

• What is the dimension of XqI 

• If is generically finite, what is the generic degree? 
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• What is the degree of Xc? 

• What is the base locus or indeterminacy locus of 0? 

• What is the singular locus of Xc? 

Problem 5.4. For a fixed type of model with k states, and number n of leaves (or taxa). 
Consider the set of rooted trees with n leaves and the corresponding arrangement A of 
varieties in C*^". Describe the stratification of A, where two points in A are in the same 
strata if they are contained in the intersection of the same models. Is the stratification of 
A the same as the stratification of the space of phylogenetic trees (cf. j3])? 

The tropicalization of a variety is the "logarithmic limit set" of the points on the 
complex variety. Tropical geometry is the geometry of the min-plus semi-ring. It was 
shown in jTH| that the tropical geometry of statistical models plays a crucial role in 
parametric inference. 

Problem 5.5. Determine the combinatorial structure of the tropicalizations of the various 
models of evolution. In particular, work out parametric inference for the substitution 
model. 

Problem 5.6. How does the tropicalization of a mixture model relate to the tropical mix- 
ture of the tropicalization of the model: that is, compare the tropicalization of secant 
varieties and joins to the secant varieties and joins of tropicalizations, see [Zj. 

In practice, it has proven to be difficult to find a full set of generators of the ideal of 
Xc, therefore, we suggest certain subsets of the ideal that may be enough to distinguish 
between different models (as Problem 15.21 asks). We think of these subsets as types of 
closure operation, for example, Xc is the Zariski closure (over C) of X^. We suggest the 
following closures as possibly easier to find and use: 

Linear closure: the linear span of X£. For work on this problem, see [22] • 

Quadratic closure: defined by the quadratic generators of the ideal. This is closely 
related to the conditional independence closure from algebraic statistics, which is 
defined by determinantal quadratic generators, i.e., quadrics of rank 4. 

Determinantal closure: defined by the determinantal polynomials in the ideal. 
For example, there is a large set of determinantal relations that hold for any of the 
models defined above. In practice, having large sets of determinantal generators 
of the ideal is convenient, as determinantal conditions can be effectively evaluated 
using numerical linear algebra, see |n|. 

Local closures: defined by invariants that each depend only on subtrees of T. Often 
these give all the invariants for a model, e.g., |24j . 

Orbit closures: applicable if the parameter space has a dense orbit under some 
group and is equivariant. Possible related objects are quiver varieties and hy- 
perdeterminants, see 

Note that part of the difficulty of studying these closure operations is coming up with a 
good definition for them. 

Problem 5.7. Study the stratifications induced by the union of the set of "closures" of 
these varieties for a given model with fixed numbers of leaves (or taxa). 
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From these rather general problems we turn to more specific, computationally-oriented 
problems in phylogenetic algebraic geometry. Many of them are special cases of the 
general problems above and are concrete starting points for attempting to resolve these 
more general problems. They also serve as an introduction to the complexity that can 
arise. 

Problem 5.8. Consider a tree T with n leaves and consider the subvariety of (C^)®'^ 
consisting of all 2 x 2 x ■ ■ ■ x 2-tables P such that all flattenings of P along edges that 
splits T have rank at most r. Is this variety irreducible? Do the determinants define a 
reduced scheme? What is the dimension of this variety? 

Problem 5.9. Consider the general Markov model on a non-binary tree T with 6 leaves. 
Is the variety Xq equal to the intersection of all models from binary trees on 6 leaves 
which are refinements of T? If the answer is yes, does the same statement hold scheme- 
theoretically? 

Problem 5.10. Given two trees T and T' on the same number of taxa, what are the 
irreducible components of the intersections of their corresponding varieties? 

Problem 5.11. For all trees with at most eight leaves, compute a basis for the space of 
linear invariants of the homogeneous Markov model, with and without hidden nodes. 
What about quadratic invariants? 

Problem 5.12. What is the dimension of the Zariski closure of the substitution model? 
Problem 5.13. Classify all phylogenetic models that are smooth. 

Problem 5.14. Compute the phylogenetic complexity of the group Z2 x Z2 (see |241 Con- 
jecture 28]). 

Problem 5.15. Study the secant varieties of the Jukes-Cantor binary model for all trees 
with at most six leaves. Do any of them fail to have the expected dimension? When do 
determinantal conditions suffice to describe these models? 

Problem 5.16. Let T be the balanced binary tree on four leaves. Compute the Newton 
polytope (as defined in ^H]) of the homogeneous model for DNA sequences. 
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